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Abstract 


Multithreading has become a dominant paradigm in general purpose MIMD parallel computa- 
tion. To execute a multithreaded computation on a parallel computer, a scheduler must order 
and allocate threads to run on the individual processors. The scheduling algorithm dramatically 
affects both the speedup attained and the space used when executing the computation. We con- 
sider the problem of scheduling multithreaded computations to achieve linear speedup without 
using significantly more space-per-processor than required for a single-processor execution. 

We show that for general multithreaded computations, no scheduling algorithm can si- 
multaneously make efficient use of space and time. In particular, we show that there exist 
multithreaded computations such that any execution schedule Y that achieves P-processor ex- 
ecution time Tp(A’) < T,/p, where T; is the minimum possible serial execution time, must use 
space at least Sp(V) > 4(p — 1)VT, + 51, where 5; is the space used by an efficient serial 
execution. For such a computation, even achieving a factor of 2 speedup (p = 2) requires space 
proportional to the square root of the serial execution time. 

By restricting ourselves to a class of computations we call strict computations, however, 
we show that there exist schedulers that can provide both efficient speedup and use of space. 
Specifically, we show that for any strict multithreaded computation and any number P of pro- 
cessors, there exists an execution schedule Y that achieves time Tp(4’) < T,/P + T.., where 
T.. is a lower bound on execution time even for arbitrarily large numbers of processors, and 
space Sp(A’) < 5; P. We demonstrate such schedules by exhibiting a simple centralized algo- 
rithm to compute them. We give a second, somewhat more efficient, algorithm that computes 
equally good execution schedules; this algorithm is online and should be practical for moderate 
numbers of processors, but its use of a centralized queue makes it inefficient for large numbers 
of processors. 

To demonstrate an algorithm that is efficient even for large machines, we give a ran- 
domized, distributed, and online scheduling algorithm that computes an execution schedule 
XV that achieves guaranteed space Sp(V’) = O(S,PlgP) and expected time E[Tp(4’)] = 
O(7,/P+T., lg P). Though this algorithm uses alg P factor more space than the centralized al- 
gorithm, it can still achieve linear expected speedup — that is E[Tp(4’)]| = O(T,/P) — provided 


the computation has sufficient average available parallelism — that is 7,/T,. = Q(Plg P). Fur- 
thermore, this algorithm is efficient in that on a PRAM or various low-latency, high-bandwidth 
fixed-connection networks, the overhead in computing the schedule is only a constant fraction 
of the execution time. 

We also show that some nonstrictness can be allowed in an otherwise strict computation in a 
way that may improve performance, but does not adversely affect the time and space bounds. 
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Chapter 1 


Introduction 


In the course of investigating schemes for general purpose MIMD parallel computation, many 
diverse research groups have converged on multithreading as a dominant paradigm. As an ex- 
ample, modern dataflow systems [9, 11, 16, 24, 25, 26, 31, 32] partition the dataflow instructions 
into fixed groups called threads and arrange the instructions of each thread into a fixed sequen- 
tial order at compile time. At run time, a scheduler employs dataflow concepts to dynamically 
order execution of the threads. Other systems have schedulers that dynamically order threads 
based on the availability of data in shared memory multiprocessors [1, 4, 13] or on the arrival 
of messages in message-passing multicomputers [2, 10, 20, 35]. 

Rapid execution of a multithreaded computation on a parallel computer requires exposing 
and exploiting parallelism in the computation by keeping enough threads concurrently active 
to keep the processors of the computer busy. If processors are busy most of the time, the 
execution schedule VY of the computation exhibits linear speedup: the running time Tp(1’) 
with P processors is order P times faster than the optimal running time 7, with 1 processor, 
that is, Tp(V) = O(T,/P). 

In attempting to expose parallelism, however, schedulers often end up exposing more par- 
allelism than the computer can actually exploit, and since each active thread requires the use 
of a certain amount of memory, such schedulers can easily overrun the memory capacity of the 
machine [8, 12, 14, 30, 34]. To date, the space requirements of multithreaded computations 
have been managed with heuristics or not at all [7, 8, 12, 14, 17, 23, 30, 34]. In this thesis, 


we use algorithmic techniques to address the problem of managing storage for multithreaded 
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computations. Our goal is to develop scheduling algorithms that expose sufficient parallelism 
to obtain linear speedup, but without exposing so much parallelism that the space requirements 
become excessive. 

We compare the amount of space Sp(A’) required by a P-processor execution schedule for 
a multithreaded computation with the space 5; used by a space-optimal 1-processor execution. 
We wish to use as little space as possible, and we argue that a space-efficient execution schedule 
exhibits linear expansion of space, that is, Sp(V’) = O(S, - P). 

Our first result shows that in general, it is not possible to achieve both linear speedup and 
linear expansion of space. We exhibit a multithreaded computation such that any execution 
schedule V that achieves P-processor execution time Tp(4’) < T,/p must use space at least 
Sp(¥V) > F(p —1)V/T, + 5). For such a computation, even achieving a factor of 2 speedup 
(p = 2) requires space proportional to the square root of the serial execution time. 

In order to cope with this negative result, we restrict our attention to the class of strict 
multithreaded computations. Intuitively, a strict computation is one in which no subroutine 
is called until all its parameters are available. We show that for any strict multithreaded 
computation and any number P of processors, there exists an execution schedule ¥ that achieves 
time Tp(4’) < T,/P+T.,, where T,, is a lower bound on execution time even for arbitrarily large 
numbers of processors, and space Sp(V) < 5; P. Such a schedule exhibits linear expansion of 
space and linear speedup, Tp(4’) = O(T,/P), provided the average available parallelism, which 
we define as 7, /7T,,, is at least proportional to P, that is, 7;/T,, = Q(P). We demonstrate 
such schedules by exhibiting a simple centralized algorithm to compute them. We give a 
second, somewhat more efficient, algorithm that computes equally good execution schedules; 
this algorithm is online and should be practical for moderate numbers of processors, but its use 
of a centralized queue makes it inefficient for large numbers of processors. 

To demonstrate an algorithm that is efficient even for large machines, we give a random- 
ized, distributed, and online scheduling algorithm that achieves space expansion proportional 
to Plg P for any strict computation and linear expected speedup for any strict computation 


with average available parallelism at least proportional to Plg P, that is, 7/7, = Q(P lg P). 
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This algorithm is efficient in that on a PRAM or various low-latency, high-bandwidth fixed- 
connection networks, the overhead in computing the schedule is only a constant fraction of the 
execution time. 

We also show that some nonstrictness can be allowed in an otherwise strict computation in 
a way that may improve performance, but does not adversely affect the time and space bounds. 

The remainder of this thesis is organized as follows. Chapter 2 develops a formal model 
of multithreaded computation and execution schedules. In Chapter 3 we characterize mul- 
tithreaded computations with three parameters and prove some basic bounds relating these 
parameters to execution time and space. The lower bound for general multithreaded compu- 
tations is presented in Chapter 4, and the upper bound for strict computations is presented 
in Chapter 5. Chapter 6 presents and analyzes a distributed scheduling algorithm for strict 
computations. In Chapter 7 we present a technique to allow nonstrictness without degrading 
the space and time bounds obtainable by a strict execution. Finally, in Chapter 8 we discuss 
some related work, and in Chapter 9 we conclude with some perspective on our results and 


some open problems. 
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Chapter 1. 


Introduction 


Chapter 2 


A model for multithreaded computation 


This chapter defines the model of multithreaded computation that we use in this thesis. We 
also define what it means for a parallel computer to execute a multithreaded computation. 

A multithreaded computation is composed of a set of threads, each of which is a sequential 
ordering of unit-size tasks. In Figure 2.1, for example, each shaded block is a thread with circles 
representing tasks and the horizontal edges, called continue edges, representing the sequential 
ordering. The tasks of a thread must execute in this sequential order from the first (leftmost ) 
task to the last (rightmost) task. In order to execute a thread, we allocate for it a chunk of 
memory, called an activation frame, that the tasks of the thread can use to store the values on 
which they compute. 

An execution schedule for a multithreaded computation determines which processors of a 
parallel computer execute which tasks at each step. An execution schedule depends on the 
particular multithreaded computation and the number of processors in the parallel computer. 
In any given step of an execution schedule, each processor either executes a single task or sits 
idle. 

During the course of its execution, a thread may create, or spawn, other threads. Spawning 
a thread is like a subroutine call, except that the calling routine can operate concurrently with 
the called routine. We consider spawned threads to be children of the thread that did the 
spawning. In this way, threads are organized into a tree hierarchy as indicated in Figure 2.1 by 
the shaded edges, called spawn edges. Each spawn edge goes from a specific task, the task that 
actually does the spawn operation, in the parent thread to the first task of the child thread. 


When a thread executes its last task, it terminates. 
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Figure 2.1: An example multithreaded computation. The tasks are partitioned into threads, 
represented by the shaded regions, and the tasks in each thread are compiled into a sequential 
order, represented by the continue edges shown horizontal in each thread. A task can spawn 
a thread, as shown by the shaded spawn edges, and this spawning organizes the threads into 
a tree hierarchy. The data dependency edges, shown by the curved edges, impose additional 
ordering constraints as required by producer/consumer relationships. 


For an execution schedule to be valid, the task execution order must obey the constraints 
given by the edges of the computation. For example, before a task can execute, its predecessor 
— which connects to it via either a continue or spawn edge — must first execute. 

There is one more kind of dependency that a valid execution schedule must respect. Consider 
a task that produces a data value that is consumed by another task. Such a producer/consumer 
relationship precludes the consuming task from executing until after the producing task. In 
order to enforce such orderings, we introduce data dependency edges as shown in Figure 2.1 by 
the curved edges. If the execution of a thread arrives at a consuming task before the producing 
task has executed, execution of the consuming thread cannot continue — the thread stalls. 
Once the producing task executes, the data dependency is resolved, and the consuming thread 
can proceed with its execution — the thread becomes ready. 

We quantify the space used in executing a multithreaded computation in terms of activation 


frames. When a task spawns a thread, it allocates an activation frame for use by the newly 
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spawned thread. Once a thread has been spawned and its frame has been allocated, we say 
the thread is active. Recall that at any time, an active thread can be either stalled or ready, 
but even if it stalls, its activation frame remains allocated. The thread remains active until it 
terminates; at that point its frame can be deallocated. 

We make the simplifying assumption that a parent thread remains active until all its children 
terminate, and thus, a thread does not deallocate its activation frame until all its children’s 
frames have been deallocated. Although this assumption is not strictly necessary, it gives the 
execution a natural structure, and it will simplify our analyses of space utilization. We also 
assume that the frames hold all the values used by the computation; there is no global storage 
available to the computation outside the frames. Therefore, the space used at a given time in 
executing a computation is the total size of all frames used by all active threads at that time, 
and the total space used in executing a computation is the maximum such value over the course 
of the execution. 

It is important to note here the difference between what we are calling a multithreaded 
computation and a program. A program may have conditionals, and therefore, the order of 
instructions (or even the set of instructions) executed in a thread may not be known until 
the thread is actually executed. Thus, what we are calling a thread actually represents a 
particular execution of a program thread. In general, a multithreaded computation is not a 
statically determined object, rather the computation unfolds dynamically during execution as 
determined by the program and the input data. We can think of a multithreaded computation 
as encapsulating both the program and the input data. The computation then reveals itself 


dynamically during execution. 


An example 


The multithreaded computation shown in Figure 2.2 contains 21 tasks, 7, v2,...,ve1, and 5 
threads, , 1,,2,---5, 5. Execution begins with the root thread , , active and ready. Thread , ; 
has activation frame size F(, 1) = 3, so the execution begins with 3 units of space in use. At 


the first step of the execution, a processor executes task v,. At the end of the first step, , ; is 
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Figure 2.2: A multithreaded computation. This computation has 21 tasks, v1, v2,...,v2,, and 
5 threads, ,1,,2,---,,5, with activation frame sizes, F(,1) = 3, F(,2) = 6, F(,3) = 3, 
Faye 2. ant 7.6) a 


still the only active (and ready) thread, and therefore, at the second step, a processor executes 
task va. Task ve spawns a child thread , » with activation frame size F(, 2) = 6. Consequently, 
the second step ends with 3+ 6 = 9 units of space in use and both , ; and , » active and ready. 
Then if the parallel machine executing this computation has at least two processors, task v¢ 
from , , and task v3 from , 2 can execute concurrently during the third step. Executing task v¢ 
spawns another thread which further increases the amount of space in use. Eventually, when 
task vs executes, thread , » terminates and decreases the amount of space in use. Furthermore, 
executing vs resolves the data dependency (v5, v20). When the execution of thread , ; reaches 
Vo9, the thread stalls until both data dependencies (v5, v9) and (v9, vq) resolve. 

Figures 2.3 and 2.4 show two different 2-processor execution schedules for the computation 
of Figure 2.2. The schedule of Figure 2.3 takes 14 time steps and 13 units of space. The 
schedule of Figure 2.4 takes 15 time steps and 21 units of space; for a period of time during the 


execution of this schedule, every thread in the computation is active. 
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Time Tasks executed Active threads Space in use 


0 G1) 3 
1 vy G1) 3 
2 V9 G1) G2) 9 
3 V3 U6 G1) G2) Gs) 12 
4 V4 U7 G1) G2) Gs) 12 
5 Vs Ug G1) Gs) 8 
6 Vo Vi G1) 53 8 
7 Vio Via G1) Gs) 6 
8 Vie V15 ) Gs) Gs) 13 
9 113 «Vie ) Gs) 10 
10 V7 ; Gs) 10 
11 V18 ; Gs) 10 
12 V19 G1) 3 
13 Vag G1) 3 
14 Va4 


Figure 2.3: An execution schedule for the computation illustrated in Figure 2.2 with two pro- 
cessors. Each row represents one time step of the computation as indicated in the first column. 
The second column lists the tasks that execute at the associated time step. The third column 
lists the threads that are active at the end of the associated time step; threads that are also 
ready are shown circled. The last column shows how much space is in use at the end of the 
associated time step. This execution takes 14 time steps and 13 units of space. 
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Time ‘Tasks executed Active threads Space in use 
0 G1) 3 
1 vy G1) 3 
2 V9 G1) G2) 9 
3 V3 U6 G1) G2) Gs) 12 
4 V4 V4 G1) G2) Gs) 12 
5 vz V15 ) G2) Gs) Gs) 19 
6 Vg Vig ) G2) Gs) Gs) 21 
7 Vi, V7 , G2) +3 +5 21 
8 Vs V9 , +3 +5 15 
9 V10 , Gs) 25 13 
10 V19 ; Gs) +8 13 
11 013 ; Gs) 10 
12 Vig ; Gs) 10 
13 Vig G1) 3 
14 Va G1) 3 
15 Vo4 


Figure 2.4: Another execution schedule for the computation illustrated in Figure 2.2 with two 
processors. This execution takes 15 time steps and 21 units of space. 


Chapter 3 


Time and space 


We shall characterize the time and space of an execution of a multithreaded computation in 
terms of three fundamental parameters: work, computation depth, and activation depth. We 
first introduce work and computation depth, which relate to the execution time, and then we 
focus on activation depth, which relates to the storage requirements. 

The two time parameters are based on the underlying graph structure of the multithreaded 
computation. If we ignore the shading in Figure 2.1 that organizes tasks into threads, our 
multithreaded computation is just a directed, acyclic graph, or dag. We define the work of the 
computation to be the total number of tasks and the computation depth to be the length of 
a longest directed path in the dag. For example, the computation of Figure 2.1 has work 17 
and computation depth 10, and the computation of Figure 2.2 has work 21 and computation 
depth 13. 

We quantify and bound the execution time of a computation on a P-processor parallel 
computer in terms of the computation’s work and depth. For a given computation, let Tp() 
denote the time to execute the computation with P processors using execution schedule 4’, and 
let 

Tp = min Tp(X) 


denote minimum time execution with P processors — the minimum being taken over all valid 
execution schedules for the computation. Then 7, is the work of the computation, since a 1- 
processor computer can only execute one task at each step, and 7,, is the computation depth, 


since even with arbitrarily many processors, each task on a path must execute serially. 
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Still viewing the computation as a dag, we borrow some basic results on dag scheduling to 
bound Tp. A computer with P processors can execute at most P tasks per step, and since 
the computation has 7, tasks, Tp > 7T,/P. And, of course, we also have Tp > T.,,. Brent’s 
Theorem [5, Lemma 2] yields the bound Tp < 7,/P+4+ 7... The following theorem extends 
Brent’s Theorem minimally to show that this upper bound on 7p can be obtained by greedy 
schedules: those in which at each step of the execution, if at least P tasks are ready, then P 
tasks execute, and if fewer than P tasks are ready, then all execute; both of the schedules shown 


in Figures 2.3 and 2.4 are greedy. 


Theorem 1 For any multithreaded computation with work T, and computation depth T,,, for 


any number P of processors, any greedy execution schedule V achieves Tp(V) <7,/P+T... 


Proof: Let G = (V, F) denote the underlying dag of the computation. Thus, we have |V| = 7), 
and a longest directed path in G' has length 7,,. Consider a greedy execution schedule VY where 
the set of tasks executed at time 7, for 7 = 1,2,...,4, is denoted €;, with k = Tp(4’). The &; 
form a partition of V. 

We shall consider the progression (G'o,G1,Go,...,G,) of dags, where Gy = G, and for 
t=1,2,...,k, we have V; = V;_, — €; and G; is the subgraph of G;_, induced by V;. In other 
words, G; is obtained from G;_, by removing from G;_, all the tasks that are executed by ¥ 
at step ¢ and all edges incident on these tasks. We shall show that each step of the execution 
either decreases the size of the dag or decreases the length of the longest path in the dag. 

We account for each step i according to |&;|. Consider a step i with |€;| = P. In this case, 
|V;| = |Vi-1| — P, so since |V| = 7), there can be at most |[7;/P| such steps. Now consider 
a step 7 with |€;| < P. In this case, since V is greedy, €; must contain every vertex of Gj_, 
with in-degree 0. Therefore, the length of a longest path in G; is one less than the length of a 
longest path in G;_,. Since the length of a longest path in G is 7.., there can be no more than 
T,. steps i with |E;| < P. 

Consequently, the time it takes schedule V to execute the computation is Tp(V) < [Z/P|+ 
To. <T,/P+T.. 7 
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Theorem 1 can be interpreted in two important ways. First, the time bound given by 
the theorem says that any greedy schedule yields an execution time that is within a factor 
of 2 of an optimal schedule, which follows because T,/P+7T,, < 2max{T,/P,T..} and Tp > 
max{T,/P,T..}. Second, Theorem 1 tells us when we can obtain linear parallel speedup, that 
is, when we can find an execution schedule V such that Tp(1’) = O(7;/P). Specifically, when 
the number P of processors is no more than the average available parallelism T,/T.,, then 
T,/P > T.., which implies that for a greedy schedule V, we have T’p(4’) < 27,/P. We shall be 
especially interested in the regime where P = O(7,/T.,) and linear speedup is possible, since 
outside this regime, linear speedup is impossible to achieve because Tp > T,,. 

These results on dag scheduling have been known for many years. A multithreaded compu- 
tation, however, adds further structure to the dag: the partitioning of tasks into threads. This 
additional structure allows us to quantify the space used in executing a multithreaded com- 
putation. Once we have quantified space usage, we will look back at Theorem 1 and consider 
whether there exist execution schedules that achieve similar time bounds while also making 
efficient use of space. Of course, we will have to quantify a space bound to capture what we 
mean by efficient use of space. 

We shall focus on a space parameter for a multithreaded computation which is based on 
the tree structure of threads. If we collapse each thread into a single node and consider just 
the spawn edges, the multithreaded computation becomes a rooted tree with the spawn edges 
as child pointers. We call this tree the activation tree. We define the activation depth of a 
thread to be the sum of the sizes of the activation frames of all its ancestors, including itself. 
The activation depth of a multithreaded computation is the maximum activation depth of any 
thread. For example, in the computation of Figure 2.2, thread , , has activation depth 8, and 
the computation has activation depth 10, since the deepest thread , 5 has activation depth 10. 

We shall have occasion to consider subcomputations and subcomputation activation depth. 
A subcomputation is the portion of acomputation rooted at a given thread in the activation tree, 
and the activation depth of a subcomputation is the activation depth of the subcomputation 


when considered in isolation as a multithreaded computation. For example, in the computation 
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of Figure 2.2, the subcomputation rooted at thread , 3 consists of 7 tasks, v7, vg,..., 13, and 2 
threads, , 3 and , 4, and has activation depth 3+ 2 = 5. 

We shall denote the space required by a P-processor execution schedule ¥ of a multithreaded 
computation by Sp(A’). Recall that S'p(A’) is just the maximum, over all steps in V, of the 
sum of the sizes of the activation frames of the active threads at that step. Since we can always 
simulate a P-processor execution with a l-processor execution that uses no more space, we have 
Si(¥V) < Sp(4’). The minimum space used by any execution with any number of processors is 
therefore S$, = miny $,(1). 

The following simple theorem shows that the activation depth of a computation is a lower 


bound on the space required to execute it. 


Theorem 2 Let A be the activation depth of a multithreaded computation, and let XY be a P- 


processor execution schedule of the computation. Then Sp(¥) > A, and hence, S, > A. 


Proof: In any schedule, the leaf thread with greatest activation depth must be active at some 
time step. Since we assume that if a thread is active, its parent is active, when the deepest leaf 
thread is active, all its ancestors are active, and hence, all its ancestors’ frames are allocated. 
But, the sum of the sizes of its ancestors’ activation frames is just the activation depth. Since 
Sp(¥X) > A holds for all V and all P, it holds for the minimum-space execution schedule, and 


hence, 5, > A. = 


Given the lower bound of activation depth on the space used by a P-processor schedule, it is 
natural to ask whether the activation depth can be achieved as an upper bound. In general, the 
answer is no, since all the threads in a computation may contain a cycle of data dependencies 
that force all of them to be simultaneously active in any execution schedule. For the class of 
depth-first computations, however, space equal to the activation depth can be achieved by a 
1-processor schedule. 

A depth-first computation is a multithreaded computation in which a left-to-right depth-first 
search of tasks in the activation tree always visits all the tasks on which a given task depends 


before it visits the given task. In fact, this depth-first search produces a 1-processor execution 
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schedule which is just the familiar stack-based execution: The serial depth-first execution begins 
with the root thread and executes its tasks until it either spawns a child thread or terminates. 
If the thread spawns a child, the parent thread is put aside to be resumed only after the child 
thread terminates; the scheduler then begins work on the child, executing the child until it 
either spawns or terminates. For the computation of Figure 2.2, the 1-processor execution 


schedule that executes tasks in the order 21, v2, 03,..., V29, V2, is the serial depth-first schedule. 
Theorem 3 For any depth-first computation, S, = A. 


Proof: At any time in a serial depth-first execution of the computation, the set of active 
threads always forms a path from the root. Therefore, the space required is just the activation 
depth of the computation. By Theorem 2, 5, >A, and thus the the space used is the minimum 


possible. rT] 


We now turn our attention to determining how much space Sp(’) a P-processor execution 
schedule ¥ can use and still be considered efficient with respect to space usage. Our strategy 
is to compare the space used by a P-processor schedule with the space required by an optimal 
1-processor schedule. Of course, we can always ignore P — 1 of the processors and obtain the 
same space bounds, and therefore, our goal is to use small space while obtaining linear speedup. 

Even for depth-first computations, a P-processor schedule may use nearly P times the space 
of a 1-processor schedule. We exhibit a depth-first computation with activation depth A= 5 
that for any number P < T,/T., of processors, requires space nearly $,P in order to achieve 
linear parallel speedup. In the computation, the root thread, which we refer to as the loop, 
spawns many children, and each child thread is the root of a large subcomputation, which we 
refer to as an iteration. The root thread has an activation frame of size 1, and each iteration 
has activation depth S$, —1. See Figure 3.1. In addition, data dependencies force a serial 
ordering on the tasks within each iteration, but there are no data dependencies between tasks 
in different iterations. In other words, the entire computation has no available parallelism 
within an iteration; parallelism can only be realized by the concurrent execution of multiple 


iterations. Executing P iterations concurrently, uses space P(.S; — 1) + 1 which is nearly 5, P. 
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S; 


Tir. 


Figure 3.1: The activation tree of a multithreaded computation for which any execution schedule 
XV requires space Sp(A’) = 0Q(.5,P) in order to achieve linear speedup. The root thread is a 
loop and each child thread is the root of a subcomputation that forms an iteration. The data 
dependencies in each iteration (not shown) link the tasks of the iteration into a sequential 
order, so there is no parallelism within the iteration. Between iterations, however, there are no 
data dependencies, so multiple iterations can be executed concurrently. The average available 
parallelism 7, /T,, equals the number of iterations. Therefore, for any number P < T,/T., of 
processors, there is an execution schedule VY (any greedy schedule for example) that achieves 


Tp(¥V) = O(7,/P) and space Sp(4’) = P(S, —1)+1= 0(S,P). 


Thus, for any number P < T,/T., of processors, this computation has an execution schedule 
WX (any greedy schedule, for example) that achieves linear speedup, Tp(4V’) = O(7T,/P), at the 
cost of space S'p(V) = O(S,P). 

In fact, a P-processor schedule that uses only P times the space of a single processor is 
arguably efficient, since on average, each of the P processors only needs as much memory as is 
used by the 1 processor. We would, of course, like to do better, but an expansion in space that 
is linear in the number of processors, while achieving linear speedup, is quite good, since the 


time-space product is bounded by a constant: 


Te(X)S p(X) = O(TS)) . 


We shall show in Chapter 4 that achieving linear speedup and linear expansion of space simul- 
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taneously is impossible in general, even for depth-first computations. For a restricted class of 
computations that we call strict, however, Chapter 5 shows that one can achieve both. 


To summarize, we can parameterize a multithreaded computation with three measures: 
e 7, denotes the work of the computation, 

e 7. denotes its computation depth, 

e A denotes its activation depth. 


For depth-first computations, $; = A. For any number P = O(T,/T,,) of processors, we would 


like to find an execution schedule ¥ with the following time and space bounds: 
e Tp(V) = O(7,/P), 


© Sp(¥) = O(5,P). 
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Chapter 4 


Lower bound 


In this chapter we show that there exist multithreaded computations for which no execution 
schedule can achieve both linear speedup and linear expansion of space. In particular, for any 
amount of serial space S and any (reasonably large) serial execution time 7, we can exhibit a 
depth-first multithreaded computation with work T, = 7 and activation depth A = S but with 
provably bad time/space tradeoff characteristics. Being depth-first, we know from Theorem 3 
that our computation can be executed using serial space S$, = A. Furthermore, we know from 
Theorem | that for any number P of processors, any greedy P-processor execution schedule 
X achieves Tp(4V) < T;/P +7... Our computation has computation depth T,, approximately 
VT, and consequently, for P = O(./T,), a greedy schedule V yields Tp(¥V) = O(T,;/P) — 
linear speedup. We show, however, that for this computation, any schedule achieving T’p(4’) = 
O(T,/P) must use space Sp(V) = Q(/T\(P — 1)). Of course, /7, may be much larger than 
51, hence, this space bound is nowhere near linear in its space expansion. 

We construct a multithreaded computation having this poor time/space performance by 
placing tasks that are computationally deep into the same portion of the computation as tasks 
that are computationally shallow. If we look at just the dag structure of the computation, it 
appears, from a distance, as shown in Figure 4.1 — the dag in Figure 4.1 is just missing a few of 
the tasks and edges that organize the computation into a tree hierarchy. The dag consists of m 
(a value we will specify later) components Co,Ci,...,Cm_1 that we call jobs. From this dag, we 
see that with any number P < m of processors, we can obtain linear speedup by simultaneously 
executing P jobs. Doing so, however, uses up lots of memory. To execute a job C;, we begin 


with a group of computationally shallow tasks called headers (see Figure 4.1). Each header is 
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Figure 4.1: The tasks in the leaf threads are organized into m jobs, Co,C1,...,Cm_i. The black 
header tasks have shallow computation depth. The white tasks form the trunk of the job. The 
grey blocker tasks have deep computation depth. 


part of a separate subcomputation with fairly large activation depth, so to execute a header 
task we must begin execution of its associated subcomputation by allocating the necessary 
activation frames. Each of these subcomputations also contains a computationally deep task, 
called a blocker (see Figure 4.1), from the previous job C;_,. Therefore, these subcomputations 
cannot complete, and the associated memory cannot be deallocated until the blockers from the 
previous job execute. But in order to achieve speedup, jobs must execute concurrently, and 
consequently, the headers must execute early and the blockers must execute late. Therefore, in 
this scenario, many subcomputations begin early, but cannot finish until late, hence the heavy 


demands on storage. 


Theorem 4 For any amount of serial space S > 4 and serial time T > 16S*, there exists a 
depth-first multithreaded computation with work T, = T, computation depth T,, < 8/T,, and 
activation depth A = S, such that for any number P of processors and any value p in the range 
Laps 21 /T 0, if V is a valid P-processor execution schedule that achieves Tp(X) < T;/p, 
then Sp(¥) > (ep — IVT, + $1. 
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Proof: To exhibit a depth-first multithreaded computation with work 7,, computation depth 
T.,, and activation depth A = S,, we first consider the dag structure of the computation. If we 
look at just the tasks in the leaf threads and ignore a few of the edges, the dag appears as in 


Figure 4.1. The tasks are organized into 
moav T,/8 
(nearly) separate components Co,C1,...,Cm—1 that we call jobs.’ Each job begins with 


A= JT,/5; 


tasks that we call headers. After the headers, each job contains 


y=6/T, 


tasks organized into a chain that we call the trunk. There are no dependencies between the 
headers, but the first task of the trunk cannot execute until after all the headers. At the end 
of each job, there are \ blockers. Each job, therefore, consists of 2A + v = 2(./1,/5,) + 6/7, 
tasks. Since there are m = /7,/8 jobs, the total number of tasks accounted for by the m jobs is 
(2/7/51 +6VT,)VT,/8 = ST, + $7, / 51, and this number is no more than ST; since $, > 4. 
The remaining (at least) 47, tasks form the parts of the computation not shown in Figure 4.1. 

When we consider how the tasks of each job are organized into the threads of the com- 
putation, we will exhibit an organization such that each header task is part of a separate 
subcomputation with activation depth at least $9}. This organization will also be such that 
each of these subcomputations contains a blocker task from a different job. In particular, each 
job C;, for = 1,...,m-—1, has each of its header tasks in a subcomputation that also contains a 
blocker task of the previous job C;_,. For each such subcomputation, the blocker task is placed 
I what follows, we refer to a number 2 of objects (such as tasks) when # may not be integral. Rounding 


these quantities to integers does not affect the correctness of the proof. For ease of exposition, we shall not 
consider the issue. 
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to ensure that the subcomputation cannot complete until the blocker task executes. Therefore, 
from the time the header task of job C; executes until the time the blocker task of job C;_, exe- 
cutes, all of the (at least) $51 space used by the subcomputation remains active. Furthermore, 
if all of the headers of C; execute before any of the blockers of C;_,, then during the intervening 
time period, A of these subcomputations are active, and these active subcomputations take up 
at least 5 SiA = sVT, space. We will show that in fact, this space consuming scenario must 
occur in any execution schedule that achieves any amount of parallel speedup. 

For any number P of processors, consider any valid P-processor execution schedule VY. For 
each job C;, let t) denote the time step at which Y executes the first trunk task of C;, and 
let th denote the first time step at which V executes a blocker task of C;. Since the trunk 
has length v and no blocker task of C; can execute until after the last trunk task of C;, we 
have t{ — #9 > vy, 

Now consider two jobs, C; and C;_,, and suppose 19) < i); this is the scenario we described 
as using at least VT, space. In this case, we consider the time interval from t) (inclusive) to 
i D (exclusive) during which we say that job C; is exposed, and we let 7; = 1), - 19) denote the 
amount of time job C; is exposed. See Figure 4.2. If 19) > 4), then job C; is never exposed and 
we let 7; = 0. As we have seen, over the time interval during which a job is exposed, it uses at 
least 3/7, space. We will show that in order to achieve speedup p — that is Tp(V) < T,/p — 
there must be some time step during the execution at which at least [=p] — 1 jobs are exposed. 

If schedule V is such that Tp(4V’) < T,/p, then we must have (9-1 < T,/p, and we can 


expand this inequality out as 


Ti/p > tat 


IV 
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Figure 4.2: Scheduling the execution of the jobs. A solid vertical interval from zh) to £0) indicates 


the time during which the trunk of job C; is being executed. When i’ Ve if - we can define an 


interval, shown dashed, of length 7; = i, — i : during which job C; is exposed. 


Considering the first sum, we recall that 10) — ti) > v, hence, 
y( 0 — 4) > mv. (4.2) 


Considering the second sum of Inequality (4.1), when 1, > ti) (so C; is exposed), we have 
7, = 2, — t, and otherwise, 7; = 0 > t?, — t&. Therefore, 
m1 m. 
Yo =a ae (4.3) 
al 4=1 
Substituting Inequality (4.2) and Inequality (4.3) back into Inequality (4.1), we obtain 
m—-1 


T,/p > mv — iG 


i=1 
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from which 
m—-1 


S- 7 > mv —T,/p. 


i=1 


Let exposed(t) denote the number of jobs exposed at time step ¢, and observe that 


Ti/p m—-1 


S- exposed(t) = S- Tj. 
t=0 


Then the average number of exposed jobs per time step is 


1 Ti/p 1 m—-1 


— exposed(t — T; 
T,/p 2 ®) T,/p 2 


1 
> mv — T, 
mv 
= —p-1 
T, ? 
3 
= -p-1l 
4? 


since m = \/T,/8 and v = 6,/7;. There must be some time step ¢* for which exposed(t*) is at 


least the average, and consequently, 


exposed(t*) > Fa —1. 


Now recalling that each exposed job uses space sVT\, we have 


IV 


sion 2 4 ((B]-) vm 


4 
1 
ql? — VT, + Sy 


IV 


for S, < /T,/4 (which is true since T, > 1657). 
All that remains is exhibiting the organization of the tasks of each job into a depth-first 
multithreaded computation with work 7, computation depth T,, < 8/7, and activation depth 


A = 5S; in such a way that for each job, each header task is placed in a subcomputation with a 
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Figure 4.3: Laying out the jobs into the threads of a multithreaded computation. In this 
example, each activation frame has unit size so. A = 6. Also, in this example \ = 2, v = 8, and 
only the first 2 out of the m tasks in the root thread are shown. Each task of the root thread 
spawns a child, and each child thread contains A+ 1 = 3 tasks; the first A of these spawn a 
child thread which is the root of a subcomputation with activation depth A—2 = 4, and the 
last one spawns a leaf thread with the vy = 8 trunk tasks of a single job. 


blocker task from the previous job and that each such subcomputation has activation depth at 
least $,/2. There are actually many ways of creating such a computation. One such way, that 
uses unit size activation frames for each thread, is shown in Figure 4.3. 

For the multithreaded computation of Figure 4.3, the root thread contains m tasks, each of 
which spawns a child thread. Each child thread contains \+ 1 tasks; the first A of these spawn 
a child thread which is the root of a subcomputation with activation depth $;—2 > 5/2 (since 
5S; > 4), and the last one spawns a leaf thread with the v trunk tasks of a single job. Each 
of these subcomputations contains a single header from one job and a single blocker from the 
previous job (except in the case of the first group of A) as shown in Figure 4.3. The header and 
blocker in a subcomputation are organized such that in order to execute the header, all 5 — 2 
of the threads in the subcomputation must be made active, and none of them can terminate 
until the blocker executes. We can verify from Figure 4.3 and from the given values of m, 4, 


and v that this construction actually has work slightly less than 7,; in order to make the work 
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equal to 7, we can just add the extra tasks evenly among the threads that contain the trunk of 
each job (thereby increasing v by a bit). Also, we can verify that T., < 8/7). Finally, looking 


at Figure 4.3 we can see that this computation is indeed depth-first. rT] 


The construction of a multithreaded computation with provably bad time/space character- 
istics as just described can be modified in various ways to accommodate various restriction to 
the model while still obtaining the same result. For example, some real multithreaded systems 
require limits on the number of tasks in a thread, data dependencies that only go to the first 
task of a thread, limited fan-in for data dependencies, or a limit on the number of children a 
thread can have. Simple changes to the construction just described can produce multithreaded 
computations that accommodate any or all of these restrictions and still have the same provably 
bad time/space tradeoff. Thus, the lower bound of Theorem 4 holds even for multithreaded 
computations with any or all of these restrictions. 

Theorem 4 tells us that for any amount of serial space S and any (reasonably) large serial 
execution time 7, there exists a multithreaded computation that can be executed serially in the 
given amount of time and space, has sufficient average available parallelism to achieve linear 
speedup over a wide range of numbers of processors, but in order to achieve any speedup at 
all, requires (potentially) extreme amounts of space. For example, in order to achieve linear 
speedup when the number of processors is close to the average available parallelism, such a 
computation requires space proportional to 7, — the serial execution time. Even to achieve 
speedup of 2 (p = 2), such a computation requires space proportional to /7, — not quite T,, 
but still potentially huge compared to 5}. 

There are actually many ways of stating a lower bound as in Theorem 4, but they all 
come down to the same thing: There exist multithreaded computations with arbitrary serial 
execution time and space and with arbitrarily large amounts of average available parallelism, 
such that achieving any amount of speedup ranging from 1 (no speedup) up to the average 
available parallelism requires space that ranges from the serial space up to nearly the serial 


time correspondingly. 


Chapter 5 


Scheduling algorithms for strict multithreaded 
computations 


Given a multithreaded computation, a scheduling algorithm for a P-processor parallel computer 
must compute a valid P-processor execution schedule. In computing such a schedule, the 
algorithm does not know the entire computation; the computation actually unfolds dynamically 
during the course of execution, and consequently, the scheduling algorithm must be online. At 
any given time during the execution, the scheduler has a set of active threads some of which 
are ready and some of which are stalled. There might be some extra information attached to 
each thread that the scheduling algorithm can use in deciding which ready threads get executed 
by which processors, but the scheduler cannot know about the structure of the portion of the 
computation not yet executed. 

Besides being able to compute an efficient execution schedule, we would like the scheduling 
algorithm itself to be efficient. In computing the execution schedule, the algorithm incurs costs 
that we can broadly classify into three categories: queueing costs, synchronization costs, and 
communication costs. The scheduling algorithm maintains active threads in one or more queues. 
By enqueuing and dequeuing threads over the course of execution, the scheduler incurs queueing 
costs. If the scheduling algorithm requires the use of any shared data or global values, it incurs 
synchronization costs. Suppose that at some time during the computation, the scheduling 
algorithm decides that a processor p should execute a task from thread , , and then at some 
later time, the scheduler decides that a different processor p' 4 p should execute a task from the 
same thread ,. In this case, some information about , , possibly the entire activation frame, 


must be moved from processor p to processor p’. In doing so, the scheduling algorithm incurs 
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(a) (b) 


e oe @ 


Figure 5.1: (a) This multithreaded computation is nonstrict since it has data dependencies, shown bold, 
that go to non-ancestor threads. (b) If we replace the offending data dependencies with new ones, shown 
bold, we obtain a strict computation since all data dependencies go from a child thread to an ancestor 
thread. 

some communication cost. 

With a P-processor parallel computer and a scheduling algorithm, given a depth-first mul- 
tithreaded computation with work 7,, computation depth T,,, and activation depth A = 5S; 
possessing average available parallelism 7/7, = Q(P), we would like the scheduling algorithm 
to compute an execution schedule V with Tp(¥) = O(T,/P) and Sp(4’) = O(S,P). 

In light of the lower-bound, we consider scheduling algorithms for a specific class of depth- 
first multithreaded computations. In particular, we consider multithreaded computations in 
which all data dependencies go from a child thread to an ancestor thread as illustrated in 
Figure 5.1. 

Requiring that all data dependencies go from a child thread to an ancestor thread can be 
viewed as requiring all function invocations (in a functional language) to be strict, and therefore, 
we refer to this class of computations as strict multithreaded computations. For example, many 


languages express parallelism with the future construct [12, 15, 21]. 
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The expression (future X), where X is an arbitrary expression, creates a task to 
evaluate X and also creates an object known as a future to eventually hold the 
value of X. When created, the future is in an unresolved, or undetermined, state. 
When the value of X becomes known, the future resolves to that value, effectively 
mutating into the value of X and losing its identity as a future. Concurrency arises 
because the expression (future X) returns the future as its value without waiting 
for the future to resolve. Thus, the computation containing (future X ) can proceed 
concurrently with the evaluation of XY. [21] 


Consider the following code fragment: 


(let ((a (future A)) 


(a 
(b (future B))) 
(+ C (Fa b))) 

Such a code fragment could appear for example in a Mul-T [21] program. Figure 5.2(a) illus- 
trates the corresponding multithreaded computation. In this example, the thread evaluating 
this code can spawn child threads to evaluate expressions A and B concurrently; to the parent 
thread, identifiers a and 6 are futures until they resolve. Furthermore, evaluation of A and 
B can proceed concurrently with the parent thread’s evaluation of expression C’. Once the 
parent thread has evaluated C (and /’) it can go ahead and spawn a child thread to evaluate 
the invocation (f a b) even if the arguments have not resolved. When a function is invoked 
with an argument that is a future, the invocation is called nonstrict, hence, we call the spawn 
nonstrict as well. To make this computation strict, we must ensure that the function value of F’ 
is not invoked until the arguments a and 6 resolve. In Mul-T, this strictness can be expressed 


with the touch construct as shown in the following code fragment: 


(let ((a (future A)) 


(a 
(b (future B))) 
(+ C (F (touch a) (touch 6)))) 


In this case, before the parent thread goes to spawn the invocation (fF a 6), it touches the 
arguments a and 6, thereby forcing the thread to stall until those arguments resolve. Then 
when it performs the spawn, the arguments are no longer futures, and consequently, the spawn 


is strict. Figure 5.2(b) illustrates the computation corresponding to this strict version of the 
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code — notice that the data dependencies now conform to the strictness condition. The strict 
version of this computation still has parallelism: The expressions A, B, and C can still by 
evaluated concurrently; it’s just that evaluation of A and B can no longer operate in parallel 
with the invocation (fF a 6). 

Strict computations are also depth-first since requiring all data dependencies to go from a 
child thread to an ancestor prohibits any data dependency going from one subcomputation of 
a thread to another subcomputation of that thread. 

For strict multithreaded computations, once a thread , has been spawned, a single processor 
can complete the execution of , and all of its descendant threads by using a depth-first schedule 
even if no other progress is made on other parts of the computation. In other words, from the 
time the thread , is spawned until the time , terminates, there is always at least one thread 
from the subtree rooted at , that is ready. This property allows us to derive algorithms to 
schedule the execution of these computations with efficient use of both time and space. 

We first show that for any strict multithreaded computation, there exists an execution sched- 
ule that achieves linear speedup with linear expansion of space. We demonstrate such schedules 
by exhibiting a completely synchronous scheduling algorithm that we call GDF (stands for global 
depth-first). On a P-processor parallel computer, for any strict multithreaded computation with 
work 7T,, computation depth T,,, and activation depth A = 5 possessing average available par- 
allelism 7/7, = Q(P), algorithm GDF computes a schedule V such that Tp(V’) = O(1,/P) 
and Sp(4’) = O($,P). This algorithm uses a centralized priority queue that is shared by all 
P processors, hence, the synchronization cost of this algorithm makes it impractical for any 
reasonably large number of processors. 

By modifying GDF we can exhibit an algorithm that is efficient for moderately sized ma- 
chines. This algorithm, which we call GDF’, uses fewer accesses to the global queue while still 
computing an equally good schedule. 

To obtain an algorithm that is efficient for large machines, we use the technique of Karp and 
Zhang [19] to replace the global priority queue with P local queues, one for each processor. By 


combining this technique with a new technique to throttle the execution and thereby maintain 
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(a) 


(b) 


Figure 5.2: (a) A nonstrict computation. The parent thread begins by spawning child threads 
to evaluate expressions A and PB. In parallel with the evaluation of A and B, the parent thread 
can continue on to evaluate expression C’. After evaluating C, the parent thread spawns a child 
thread to evaluate the invocation (fF a b). This spawn can occur even before expression A or B 
has completed evaluation, in which case at least one of the corresponding identifiers, a or b, is 
still a future and the spawn is nonstrict. (b) A strict version of the same computation. In this 
case, the parent thread must stall at task w until both expressions A and B have completed 
evaluation. Thus, the corresponding identifiers, a@ and 6, are no longer futures when the spawn 
occurs, and the spawn is strict. 
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a modest degree of synchrony among the processors, we obtain a randomized algorithm that 
we call LDF (stands for local depth-first). For any strict multithreaded computation with lg P 
slack in its average available parallelism — that is 7,/T,. = Q(Plg P) — algorithm LDF 
computes a schedule VY with guaranteed space bound Sp(4’) = O(.5;P lg P) and expected time 
bound E[Tp(4’)] = O(7;/P). This algorithm is simple and distributed (it requires no global 
control nor any global data structures), and therefore, on a PRAM and certain low-latency, 
high-bandwidth fixed-connection networks, the scheduling costs are no more than a constant 


factor of the execution time. 


Centralized scheduling algorithms 


Algorithm GDF maintains all active threads in a global queue prioritized by activation depth 
— the deepest threads get highest priority. At each step of the algorithm, the scheduler removes 
from the queue the P deepest ready threads (if there are fewer than P ready threads, it just 
removes them all) and assigns them arbitrarily to the P processors so that each processor 
receives at most one thread. Each processor that has an assigned thread then executes one task 
from that thread. To complete the step, all surviving threads and all newly spawned threads 


are placed back into the global queue. 


Theorem 5 For any number P of processors and any strict multithreaded computation with 
work T,, computation depth T.., and activation depth A = $,, algorithm GDF computes a 
schedule X that achieves space p(X’) < $,P and time Tp(V) < T,/P 4+ T... 


Proof: The time bound follows immediately from Theorem | since GDF always produces a 
greedy schedule. 

To prove the space bound, we show that the queue never contains more than P threads 
(ready or not) that span any activation depth. A thread , spans an activation depth d, if, has 
activation depth A(, ) > d, and either , is the root or the parent thread ,‘ of , has activation 
depth A(,’) < d. See Figure 5.3. For any time step ¢ during the execution and any activation 
depth d, let s(t,d) denote the number of active threads that span d at the start of step ¢. Then 
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the total space s(t) being used at the start of time step ¢ is 


Sy 


s(t) = So s(t, d). (5.1) 


d=1 


By induction on the number of steps, we show that for all t, every activation depth d, has 
s(t,d) < P. With this bound, Equation (5.1) shows that s(t) < $,P for all time ¢, from which 
the space bound follows. 

The algorithm begins with just one active thread (the root), so for every activation depth 
d, we have s(1,d) < 1 < P. Now consider any activation depth d, and suppose that for time 
step t, the induction hypothesis s(¢,d) < P holds. The computation being strict means that 
for each of the s(t,d) active threads that span d at the start of step ¢, there is at least one 
ready thread with activation depth greater than or equal to d — remember, this property is 
the crucial property that we get by having all data dependencies go from a child thread to an 
ancestor thread. Therefore, step t begins with at least s(t,d) ready threads at or deeper than d. 
The depth-first ordering then ensures that no more than P — s(t,d) threads with depth less 
than d can execute at step t. Then since the only way to increase the number of threads that 
span d is to execute a thread shallower than d that spawns a child thread at or deeper than d, 
step t ends with at most s(t,d)+(P—s(t,d)) = P active threads that span activation depth d. 


Therefore, s(t + 1,d) < P, and the induction is complete. | 


We can make this algorithm more efficient by reducing the number of accesses to the global 
queue as follows. The algorithm begins with the root thread assigned to some arbitrary pro- 
cessor and the global queue empty. In general, at the start of a step, some processors have an 
assigned thread and some don’t. Consider a step that begins with n processors that do not have 
a thread. In this case, to start the step, the scheduler removes from the queue the n deepest 
ready threads (if there are fewer than n ready threads, it just removes them all) and assigns 
them arbitrarily to the n processors so that each processor receives at most one thread. Each 
processor (now considering all P of them) that has an assigned thread then executes one task 


from that thread. Unless that thread spawns, terminates, or stalls, the processor can keep its 
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Figure 5.3: The activation tree corresponding to the example computation of Figure 2.1. Each 
black node corresponds to a thread and the edges correspond to the spawn edges. Associated 
with each thread is an activation frame depicted by the grey rectangles drawn with height 
equal to the size of the frame. Notice that the activation frames are located so that the top of 
a thread’s frame is just below the bottom of its parent’s frame. In this way each thread’s black 
node is drawn at its activation depth (depth increases in the downward direction). The threads 
that span activation depth d are indicated by highlighting the activation frame’s border. 


thread so it will have a thread to start the next step. If the thread stalls, the processor must 
return it to the global queue, and consequently, the processor will not have a thread to start 
the next step. Similarly, if the thread terminates, the processor will not have a thread to start 
the next step. Lastly, if the thread spawns, the processor returns the parent thread (the one it 
was working on) to the global queue and keeps the child thread, and therefore, in this case, the 
processor will still have a thread to start the next step. 

This version of the algorithm, which we call GDF’, achieves the same performance bounds 
as proved in Theorem 5, but requires access to the global queue only when threads spawn, 


terminate, or stall. 
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Theorem 6 For any number P of processors and any strict multithreaded computation with 
work T,, computation depth T.,, and activation depth A = S,, algorithm GDF’ computes a 
schedule X that achieves space p(X’) < $,P and time Tp(V) < T,/P 4+ T... 


Proof: This proof follows the proof of Theorem 5, but we add the following assertion to the 
induction hypothesis: For any activation depth d, if a step ¢ begins with s(t,d) < P active 
threads that span depth d, then step ¢ begins with no more than P — s(t,d) processors that 
have a thread with activation depth less than d. If this assertion is true at the start of step f, 
then at least s(t,d) processors get assigned to threads at or deeper than d, and step t+ 1 begins 
with s(¢+ 1,d) < P active threads spanning d. Also, since no more than P — s(t, d) processors 
work on threads shallower than d during step t, step t+ 1 begins with no more than P — s(t, d) 
processors that have a thread shallower than d. We consider two cases based on the relative sizes 
of s(¢,d) and s(¢+1,d). If s(¢+1,d) < s(¢,d), then P—s(t,d) < P—s(t+1,d), and hence, step 
t+1 begins with no more than P—s(t+1,d) processors that have a thread with activation depth 
less than d. On the other hand, if s(¢+ 1,d) > s(t,d), then s(t+1,d)— s(t,d) processors must 
have executed a thread less deep than d that spawned a child thread at or deeper than d during 


step t. Each of these processors only keeps the thread with depth greater than or equal to d, and 


consequently, step +1 begins with no more than P—s(t,d)—(s(¢+1, d)—s(t,d)) = P—s(t+1,d) 
processors that have a thread with activation depth less than d. In either case, step ¢+1 begins 
with no more than P—s(t+1, d) processors having a thread less deep than d, thereby completing 


the induction. | 


This algorithm may be feasible for a modest number of processors, but for a large number 
of processors, the cost of synchronization at the global queue becomes prohibitive. To derive a 
truly scalable and distributed algorithm, we need to split the global queue into P local queues 


— one for each processor. 


44 


Chapter 5. Scheduling algorithms for strict multithreaded computations 


Chapter 6 


Distributed scheduling algorithms 


In a distributed scheduling algorithm, each processor works depth-first out of its own local 
priority queue. Specifically, to get a thread to work on, a processor removes the deepest ready 
thread from its local queue. Ideally, we would like the processor to then continue working on 
that thread until it either stalls, terminates, or spawns, and when the processor does need to 
enqueue a thread (as in the case when the thread stalls or spawns) or dequeue a new thread, 
it does so by accessing only its local queue. Of course, this approach could result in processors 
with empty queues sitting idle while other processors have large queues. Thus, we require 
each processor to have some access to non-local queues in order to facilitate some type of load 
balancing. 

The technique of Karp and Zhang [19] suggests a randomized algorithm in which threads 
are located in random queues in order to achieve some balance. At the end of this chapter, we 
show that the naive adoption of this technique does not work. In order to achieve the desired 
result, we modify the Karp and Zhang technique by incorporating a new mechanism to enforce 
a modest degree of synchrony among the processors. 

Algorithm LDF operates in iterations with each iteration consisting of a synchronization 
phase followed by a computation phase and ending with a communication phase. In a syn- 
chronization phase, we compute a cutoff depth D that is a global value made available to all 
processors. During the following computation phase, only those threads with activation depth 
greater than or equal to D can execute. Finally, the communication phase redistributes threads 


to random locations. 
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The operation of each phase is governed by a synchronization parameter r that affects both 
the time and space performance of the algorithm. Let LDF(r) denote Algorithm LDF with 
synchronization parameter r. 

In a synchronization phase of LDF(r), we use the synchronization parameter r to compute 
the cutoff depth D. Each processor p;, fori = 1,...,.P, computes the activation depth d; of its 
rth deepest ready thread. In other words, d; is the activation depth for which processor p; has 
fewer than r ready threads deeper than d; but at least r ready threads at or deeper than d;. 
Cutoff depth D is then computed simply by 


D= max d; 
1<i<P 


as illustrated in Figure 6.1. 

During the computation phase of LDF(r), each processor executes one task from each ready 
thread with activation depth greater than or equal to the cutoff depth D in its local queue. We 
further forbid each processor from executing more than r spawns; if a processor has more than 
r threads at or deeper than D that want to spawn, it may only execute r of them. 

The iteration ends with a communication phase during which each processor must move 
each ready thread with activation depth greater than or equal to D (as determined at the 
beginning of the iteration) and each newly spawned thread from its local queue to a queue 
selected uniformly at random, independently for each thread. 

By using the synchronization parameter r to compute the cutoff depth and then ensuring 
that each processor executes only tasks from threads at or deeper than the cutoff depth while 


allowing at most r spawns, we get a guaranteed space bound. 


Lemma 7? For any number P of processors and any strict multithreaded computation with ac- 


tivation depth A = $,, Algorithm LDF(r) computes a schedule Y such that Sp(¥V’) < 2rS)P. 


Proof: We show by induction on the number of iterations that no activation depth ever has 


more than 2rP active threads that span it. Specifically, recalling the notation used in the proof 
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Figure 6.1: Computing the cutoff depth. Each column represents the local priority queue of a processor, 
and each row represents an activation depth with depth increasing in the downward direction. We depict 
each thread by a circle located at its activation depth. The ready threads in each queue are ordered by 
activation depth with ties broken arbitrarily. In this example, the synchronization parameter r = 12, 
and the rth deepest ready thread for each processor is shown in black. The deepest of these black threads 
determines the cutoff depth. Only the ready threads at or deeper than the cutoff depth — those in the 
shaded region — can execute during the following computation phase. 


of Theorem 5, we show that for every activation depth d and every iteration ¢ of the execution, 
s(t,d) < 2rP. The result then follows from Equation (5.1). As before, the base case is obvious. 

For any activation depth d and any iteration ¢ of the execution, we consider 2 cases. In 
the first case, suppose iteration t begins with rP < s(t,d) < 2rP active threads spanning 
depth d. Due to the strictness of the computation, there must be at least rP ready threads 
with activation depth greater than or equal to d, and by pigeon-holing, some processor’s local 
queue must have at least r of them. Therefore, the cutoff depth D will be set with D > d. 
Consequently, during the computation phase of iteration t, no thread with activation depth 
less than d can execute and the iteration ends with no more active threads spanning depth d 


than it started with. Now suppose iteration ¢ begins with s(¢,d) < rP active threads spanning 
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depth d. In this case, during the computation phase, since each processor is only allowed r 
spawns, the number of active threads that span depth d can increase by at most rP, and 
therefore, the iteration ends with no more than 2rP active threads spanning depth d. In either 


case, s(t + 1,d) < 2rP, which completes the induction. | 


In order to achieve speedup in the execution time, we must ensure that during the computa- 
tion phase of each iteration, each processor has some ready threads at or deeper than the cutoff 
depth. To ensure that the cutoff depth is not set too deep, we must use a large enough synchro- 
nization parameter r. On the other hand, the space bound of Lemma 7 is directly proportional 
to r. By setting r = 6lg P, the space bound of Lemma 7 becomes Sp(A’) < 125, Plg P, and 
with high probability, most computation phases take O(lg P) time and get at least Plg P tasks 
executed as we now show. 

To analyze the running time, we say that each iteration either succeeds or fails depending 
on how many tasks execute. An iteration that begins with at least Plg P ready threads fails if 
fewer than P lg P of the ready threads get a task executed. An iteration that begins with fewer 
than Plg P ready threads fails if not all of them get a task executed. 

We now show that with the synchronization parameter set to r = 6lg P, each iteration fails 


with probability no more than P~°. 


Lemma 8 for any number P of processors, an iteration of Algorithm LDF(6lg P) fails with 


probability no more than P-°®, 


Proof: Consider an iteration that begins with at least Plg P ready threads, and suppose that 
when two threads have the same activation depth, we give each thread a unique identifier to 
break the tie so we can uniquely identify the Plg P deepest ready threads. If no local queue 
contains more than 6lg P of the Plg P deepest ready threads, then the synchronization phase 
sets the cutoff depth so that all Plg P of these deepest threads are at or deeper than the cutoff 
depth. Therefore, an iteration that begins with at least Plg P ready threads succeeds if no 
local queue contains more than 6lg P of the Plg P deepest ready threads. 


Consider a particular processor p; and let the random variable Z; denote how many of the 
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Plg P deepest ready threads start the iteration in the local queue of processor p;. Each thread 
is located independently at random, hence, the random variable Z; has a binomial distribution 


with Plg P trials and success probability 1/P. Therefore, 


Pig P\ /1\°'8? 
Pr{Z, > 6leP}< — 
r{Zi > 6lg 1< (MEP) (F) 


()<G) a 


and the fact that 6 > 2e, we can upper bound Pr {Z; > 6lg P} by 


Then from the bound 


ePlgP\ (/1\°'8” 
priz> ole) < (BE) (4) 
€ 6lgeP 
- G) 
< p®, 


Now let Z = max) <;<p Z;. For an iteration that begins with at least Plg P ready threads, the 
probability of failure is no more than Pr{Z > 6lg P}. We can use Boole’s Inequality to upper 
bound Pr{Z > 6lg P} by adding the individual probabilities, from which, 


Pr{Z > 6lg P} < P-Prf{Z; > 6leP} <P. 


For the case of an iteration that begins with fewer than Plg P ready threads, the failure 
probability is still upper bounded by Pr{Z > 6lg P} where the random variable Z has the 


distribution just described. rT] 


We now show that iterations fail independently of each other. Specifically, we show that 
knowing whether an iteration ¢ fails provides no information about whether any future iteration 
fails. The failure of an iteration depends only on how the ready threads are distributed among 
the processors. Therefore, we need to show that knowing whether iteration ¢ fails provides no 


information about the distribution of threads at the end of the iteration. Suppose iteration ¢ has 
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cutoff depth D. No matter if iteration ¢ fails or not, the iteration ends with a communication 
phase in which every ready thread at or deeper than D gets moved to a random location. Thus, 
iteration t provides no information about the distribution of threads at or deeper than the cutoff 
depth. Now consider the threads less deep than D. The only part of an iteration that even 
considers the threads shallower than the cutoff depth is the synchronization phase. Therefore, 
we need to show that computing the cutoff depth provides no information about the distribution 
of threads with activation depth less than D. Consider an alternative method for computing 
the cutoff depth. Let all the processors work in synch from the bottom up. First each processor 
counts the number of ready threads it has with activation depth $,. Then each processor adds 
on the number of ready threads it has with activation depth $,—1. We continue in this manner 
until some processor reaches a count of r (the synchronization parameter). At this depth we 
stop and set the cutoff depth. In this way the synchronization phase can compute the cutoff 
depth with the exact same result but without ever considering threads shallower than D. Thus, 
computing the cutoff depth provides no information about the distribution of threads shallower 
than the cutoff depth. 

With iterations failing independently of each other, we can bound the number of failed 


iterations, thereby bounding the total number of iterations taken. 


Lemma 9 For any number P of processors and any strict multithreaded computation with 
work T, and computation depth T,,, for any € > 0, with probability at least 1 — €, Algorithm 
LDF(6lg P) computes a schedule X that takes O(T,/(Plg P)+T.. + logp(1/e)) iterations. 


Proof: First we consider the failed iterations. Let the random variable f denote the number 
of failed iterations. We will show that for any € > 0, the probability that f > eT /(Plg P)+ 6 
is no more than € when 6 = ¢logp(1/e). There are at most T, iterations since each iteration 
always results in at least one task being executed, and each iteration fails independently with 


probability P-°. Therefore, f is bounded by a binomial distribution with 7, trials and success 
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probability P-°, from which 


T, T, 1 yr 
P > be < — 
(le eet hs (a) PS 


Then using Inequality (6.1) we get 
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and P-*> = ¢ for b = ¢logp(1/e). Thus, with probability at least 1—«, f = O(T,/(PlgP) + 
log p(1/e)). 

Now consider the successful iterations. A successful iteration that begins with at least 
Plg P ready threads, executes a task from at least Plg P of them, and a successful iteration 
that begins with fewer than Plg P ready threads, executes a task from every ready thread. 
Therefore, we can think of each successful iteration as a step in a greedy schedule with Plg P 
processors. Then, as in the proof of Theorem 1, we know that there can be no more than 
T,/(P lg P)+ T.. successful iterations. 

Adding together the number of successful iterations and the number of failed iterations 


completes the proof. rT] 


Now if we let the random variable X,; denote the time taken by the zth computation phase 
of Algorithm LDF(6lg P), we can give the total time in computation phases as the random 
variable X = X, + X_.+---+ Xy where Y is the random variable denoting the number of 
iterations. The time taken by the ith computation phase is proportional to the maximum 
number of ready threads with activation depth greater than or equal to the cutoff depth in 


any processor. There can be a total of at most 18PlgP ready threads at or deeper than the 
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cutoff depth — r = 6P lg P deeper than the cutoff depth and 12P lg P at the cutoff depth (from 
Lemma 7 with synchronization parameter r = 6lg P) — and each of these threads is located 
independently at random. Thus, we can bound each X; as the size of the largest bin when 
throwing 18Plg P balls at random into P bins. Furthermore, by the independence argument, 


the X;,’s are independent. We can now bound the random variable X. 


Lemma 10 Let the random variable X denote the sum of Y mutually independent random 
variables, X = X,+ Xy+---+ Xy with each X;, fori =1,...,Y, distributed as the number 
of balls in the fullest bin when throwing P|n P balls independently at random into P > 2 bins. 
Then for any € > 0, we have X = O(Y In P + lg(1/e)) with probability at least 1 — e. 


Proof: We have 


Pr{X >aYInP+b} = Pr {ex* > lav in Pabyre) 


E er! eo (aY In P+b)/¢ (6.2) 


lA 


by Markov’s inequality. By the independence of the X;’s, 


B [eX/*| = Te ler) (6.3) 


From the definition of expectation, 


PinP 


BleX/"| = So Pr{X; = jhe! 


j=lnP 


To bound E [e**/*], we break this sum into pieces. First we break out the terms from j = In P 


to 7 = e? ln P — 1, which yields 


e* In P-1 PinP 
E [ex] = S- Pr {X; = jheile + S- Pr {X; = jhelle, (6.4) 
j=lnP jae? InP 


The first of these sums we bound by factoring out the largest term and upper-bounding the 
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sum of probabilities by 1: 


e? In P-1 e? ln P-1 
S> Pr{X;= gfe < So Pr{xX,=jperm? 
j=ln P jain P 
e? In P-1 
_ ee InP S- Pr {X; _ jh 
j=ln P 
< eo MP, (6.5) 


To bound the second sum in Equation (6.4), we further break the range of the index variable 
j into smaller pieces indexed by k = 3,..., [In P] — 1, with piece k going from j = e* ln P to 
ja=et!In P—-1: 


PinP fin P]-1 fe®t+4In P-1 
S> pr{Xesell = YS dS Pr{X; = jhe” 
jre%lnP k=3 joeklInP 


M4 


[In P]-1 e®t1 In P-1 
ee InP S- Pr{Xx; =j} 


joekln P 
[In P]-1 
S- eo MP pr LX, > e* In P} 
k=3 
[In P]-1 

S> Pp Pr{X; > e*ln P}. (6.6) 


k=3 


lA 


Now we can bound Pr{X; > e*In P} by the same technique as in Lemma 8, since X; has the 


same distribution has the random variable Z considered in the proof of Lemma 8: 


Pp Pin P (a) 
e* In P P 


Pew (h- Vet In P 


Pr{X; > e"In P} 


lA 


lA 


Prk Vert. 
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Substituting this bound into Inequality (6.6) yields 


PinP [in P]-1 
S- Pr {X; _ jpelle < S- Po pr Vert 
je? lnP k=38 
< So pointe 
k=3 
<1, (6.7) 


since the sum is bounded by the geometric sum 77, 2-* = 1. Now, we can substitute Inequal- 


ity (6.5) and Inequality (6.7) back into Equation (6.4), producing 


2 
e MPa 


lA 


E ler/| 


< ele tin P 


Substituting this bound into Equation (6.3) and then substituting into Inequality (6.2), we 


obtain 


Pr {xX > aY In P 4+ by < elle tl) In PY .—(a¥ In P+b)/e 


b 
= exp (= (2- = 1) yin P=) 
€ € 
( -) 
< exp|—-— 
€ 


fora > e? +e. Thus, with b = eln(1/e), we obtain 


Pr{X >(e&+e)¥nP+eln(1/c)} <e. 


We can now characterize the time and space usage for execution schedules computed by the 


LDF algorithm with synchronization parameter r = 6lg P. 
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Theorem 11 For any number P > 2 of processors and any strict multithreaded computation 
with work T,, computation depth T,,, and activation depth A = 5S, Algorithm LDF(6lg P) 
computes a schedule X that uses space S'p(X) = O(.5,P lg P), and for any € > 0, with probability 
at least 1 — €, the schedule uses time Tp(V) = O(T;/P + T.. lg P + lg(1/e)). 


Proof: The space bound follows directly from Lemma 7 with synchronization parameter r = 
6lg P. The time Tp(1’) is the total time taken in computation phases. Let the random variable 
Y denote the number of iterations. Then we can decompose 7’p(1’) as a sum of Y mutually 
independent random variables, Tp(V) = X, + Xo +---+ Xy with each X; distributed as 
the size of the fullest bin when throwing 18Plg P balls independently at random into P bins. 
Using ¢€/2 as the value of € in Lemma 9, we obtain Y = O(T,/(Plg P)+ T., + logp(1/e)) with 
probability at least 1—e¢/2. Then, using €/2 as the value of € in Lemma 10, we obtain Tp(4’) = 
O(Y lg P + lg(1/e)) with probability at least 1 — €/2 (using 18P lg P instead of Pln P only 
affects the constant). Thus, with probability at least 1 —, the total time taken in computation 


phases is Tp(V’) = O(1,/P + T.. lg P + lg(1/e)). 7 


Corollary 12 For any number P > 2 of processors and any strict multithreaded computation 
with work T, and computation depth T,,, Algorithm LDF (6lg P) computes a schedule X with 
expected execution time E[Tp(¥V)] = O(7;/P + To. lg P). 


Proof: Just use € = 1/P in Theorem 11 to get Tp(¥V) = O(7;/P + T. lg P) with probability 
at least 1 —1/P. Then 


1 T, 1 
< a 7. _ 
E[Tp(X)] < (1 ) O (3 + To, lg Pr) +5ri 


The LDF(6lg P) algorithm achieves linear expected speedup when the computation has 
average available parallelism 7/7, = Q(P lg P). 
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We can view the lg P factors in the space bound and the average available parallelism 
required to achieve linear speedup as the computational slack required by Valiant’s bulk- 
synchronous model [33]. The space bound Sp(4’) = O(5,PlgP) indicates that Algorithm 
LDF(6lg P) requires memory to scale sufficiently to allow each physical processor enough 
space to simulate lg P virtual processors. Given this much space, the time bound E [7p(1’)] = 
O(7,/P + T.. lg P) then demonstrates linear expected speedup provided the computation has 


Ig P slack in the average available parallelism. 


Practical considerations 


In many models of parallel computation, the queueing, synchronization, and communication 
costs for Algorithm LDF(6lg P) are only a constant fraction of the execution time. If a global 
max across the P processors can be accomplished in O(lg P) time, then each synchronization 
phase takes only O(lg P) time, and since each computation phase takes Q(lg P) time, the 
synchronization phases take at most a constant fraction of the total time. To ensure that the 
communication costs make up only a constant fraction of the total time, each processor must 
be able to send w = Q(lg P) threads to random processors in O(w) time. For each thread, the 
communication may involve sending just a word or two of thread description, or it my involve 
sending the entire activation frame. When the amount of information that needs to be sent with 
each thread is just some constant amount, then these requirements are met by a hypercube or 
indirect butterfly using Ranade’s algorithm [28] to do the routing. 

In order to facilitate the analysis of the LDF algorithm, we had to use a rather large syn- 
chronization parameter, but in practice, we expect that Algorithm LDF can be implemented 
with significantly smaller values of r and a small constant in the expected time bound of Corol- 
lary 12. With greater care in the analysis, the synchronization parameter can be reduced from 
6lg P to 41g P. This reduction in r, reduces the space bound of Theorem 11 from 12.5; Plg P 
to 85,Plg P. The constant hidden in the expected time bound of Corollary 12 then works out 
to be slightly less than 69; as the number P of processors increases, however, this constant 


approaches 34. These constants are, of course, artifacts of the analysis. A proper value for 
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the synchronization parameter should be determined empirically. With fairly large machines, 
values of r much smaller than 41g P should work to yield small constants in both the space and 
expected time bounds. 

If implemented, LDF(r) can be modified to allow more asynchrony in the execution, re- 
quire less thread migration, and take better advantage of specific processor architectures. In 
particular, during an iteration, each processor can work on threads in any way it desires so long 


it obeys the following rules. 
1. Only threads at or deeper than the cutoff depth may execute. 
2. Only r spawns may be performed. 


3. Each thread at or deeper than the cutoff depth must finish the iteration at a random 


location. 


With these rules, an iteration can continue for an arbitrarily long time. The computation 
phase only has to end when a constant fraction of the processors no longer have work to do. 
In the case of a computation phase in which more than a constant fraction of the processors 
go idle before lg P steps, the phase cannot end until each of the other processors has executed 
at least one task from each of its threads at or deeper than the cutoff depth (modulo rule 2). 
Once enough processors go idle, the communication phase begins to ensure that each processor 
observes rule 3 (and this last provision if necessary), and then the iteration ends. During the 
computation phase, some processors may want to interleave the execution of multiple threads 


while others may prefer long runs with a single thread. 


Bounding individual processor storage requirements 


The space bound of Theorem 11 is an aggregate bound, but in a distributed memory machine, 
we may want to bound the space associated with each individual processor’s queue. In the LDF 
algorithm, each active thread is located in the local queue of a processor chosen at random, 
so we assume that each activation frame is located in the local memory of the same randomly 


chosen processor as its associated active thread. Since Lemma 7 shows that the aggregate space 
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used by Algorithm LDF(r) is bounded by 275,P, we would like some way to ensure that each 
individual processor requires space bounded by O(7rS}). 

Since activation frames are located in randomly chosen processors, we can show that at 
any given iteration, the expected spaced needed by any given processor is no more than 275}. 
Suppose that at some iteration ¢, there are & active threads with frame sizes F,, Fo,..., Fp. 
Consider a particular processor p, and let the random variable W denote the total space being 
used by activation frames located in the memory of processor p. We can decompose W as the 


weighted sum of mutually independent indicator random variables: 
W =F,W,4+ FoW.+---+ FW, 


where the random variable W; indicates whether the 7th active thread is located at processor p. 
Since each active thread is located at a processor chosen uniformly at random, the expected 
value of W; is given by E[W;] = Pr{W; = 1} = 1/P. Then we can bound the expected value 
of W by 


E(W] = F,E(W,])+ F.E(W.]+---+ FE [W;] 
Le 
= pF 
t=1 
< | nS, P 
S pres 


= 2rsi, 


since the sum of the frame sizes is bounded in Lemma 7 by 2r5,P. 
To show that for any given iteration ¢, with high probability, no processor uses more than 
O(rS,) space for activation frame storage, we use the following result due to Raghavan [27, 


Theorem 1]. 
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Lemma 13 (Raghavan) Let a,,a2,...,a, be reals in (0,1). Let w1,%2,..., a, be independent 
Bernoulli trials, and let UV = yr a;tb;. Then for any 6 > 0, 


; EL] 
Pr{W > (14+ 5)E[W]} < (a) 


Setting a; = F;/5,, we can bound Pr{W > 2erS,} by applying Lemma 13 to the random 
variable VW with E[W] = 2r. Then 


Pr{W > 2erS,} < Pr{Wv > eE[¥]} 


(= ) E[¥] 
e& 


—2r 


=e 


If the synchronization parameter is set with r = r’ln P where r’ > 1, this probability is no 
more than P-?". Then, since there are P processors, the probability that any processor uses 
more than 2er$, = O(r$}) space at iteration t is bounded by P~@"'-), 

This probabilistic bound shows that with an appropriate choice of synchronization param- 
eter, we can allow each processor O(rS,) space and ensure that no processor ever exceeds this 
space allotment by simply rerandomizing the thread locations any time a processor fills up its 
allotted space. As we just proved, the probability that rerandomizing is needed at any given it- 
eration is no more than P~(2"'-), Therefore, the expected number of times that rerandomizing 
is needed over the course of the entire execution is no more than T,/P?"~!. If rerandomizing 
can be accomplished in O(r5,) time — as is the case with a fully-connected, hypercube, or 


indirect butterfly network — then the total expected time taken by rerandomization is no more 


than 
T, T, 7 


This total expected rerandomization time is O(T)/P) provided r’ = Q(1 + logp S;). Thus, by 
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setting the synchronization parameter to r = O(lg P + lg S;) and rerandomizing thread loca- 
tions when any processor fills its space allotment, the LDF algorithm achieves linear expected 
speedup (provided the computation has lg P slack in its average available parallelism) with 
each processor’s storage requirement bounded by O(Si(lg P + lg $1)). When 5; is bounded by 
a polynomial in P, this space bound is O(S; lg P). 


Simple strategies that don’t work 


To conclude this chapter, we now show that some simpler ways of adopting the Karp and Zhang 
technique do not work. 

The most natural thing to try is to have each processor work depth-first out of its local 
queue and spawn new threads to random locations. Specifically, when a processor executes a 
task from a thread, if it spawns a new thread, the original thread is kept locally and the new 
child thread is moved to the queue of a processor selected uniformly at random from all P 
processors. With this method, once a thread gets spawned and placed into a random queue, 
it never has to migrate. Unfortunately, this method does not work as the following scenario 
illustrates. Suppose a processor p has as its deepest thread a thread that just keeps spawning 
children — a loop with many iterations for example — and each child thread has a unit size 
activation frame. Suppose also that this loop thread is at activation depth d and all the other 
processors are busy executing long threads at activation depths greater than d+ 1. In this 
case, most of the invocations (which have depth d+ 1) spawned by the loop thread land in the 
queues of the other P — 1 processors and languish there. The occasional invocation that lands 
at processor p temporarily interrupts the loop thread, but if each loop invocation is just a short 
thread, processor p quickly resumes executing the loop thread. Thus, the loop invocations just 
keep piling up and eventually overflow memory. 

To fix this problem, we must force threads to migrate. After a processor executes a task 
from a thread, it moves that thread to the local queue of a processor selected uniformly at 
random, and as before, any newly spawned threads are placed at random. Unfortunately, even 


this method does not work as the following scenario illustrates. Consider an activation depth d 
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and suppose the threads at d are long sequential threads with unit size activation frames and 
the threads at depth d— 1 just keep spawning these long threads. At each step, if a processor 
has a depth d thread, it just executes a task from that thread and then moves that thread to 
a random processor; otherwise, it executes a depth d—1 thread which spawns a child at depth 
d. Therefore, if we look at the queues at depth d as bins and the threads as balls, we have the 
following process. At each step, one ball is removed from each non-empty bin and P new balls 
are thrown at random into the P bins. If we consider this process over n steps and consider the 
balls arriving in a particular bin, we have a binomial distribution with mean n and standard 
deviation O(,/n). Thus, we can show that the expected number of balls that arrive into the 
fullest bin is n+ Q(,/n). During this time, at most n balls are removed from this bin, hence, 
this bin contains Q(,/n) balls at the end of these n steps. Recall that each ball corresponds 
to an activation frame, and therefore, this probabilistic analysis shows that over time, some 
queue’s size grows as the square root of the elapsed time. 

The basic problem in the above scenario is that with a purely random process without any 
global control, over time, processors get out of synch with each other. Even though there may 
be lots of deep threads in the system, every once in a while, some processor will be without 
any of these deep threads and therefore will execute a task from a shallow thread that spawns 
a child. Thus, over time, the number of threads in the system can just keep growing. Our 
solution to this problem uses a moderate degree of global control to throttle the execution of 
processors that get out of synch. We implement this throttle by maintaining a cutoff depth 
to ensure that all processors only execute threads that are among the deepest in the system; 
a processor that does not have any of these deepest threads cannot execute any tasks until it 


gets some of these deepest threads. 
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Chapter 7 


Scheduling nonstrict, depth-first multithreaded 
computations 


The algorithms of Chapterss 5 and 6 for strict multithreaded computations can also be used for 
nonstrict, depth-first computations — just change the computation to make it strict and then 
execute the strict computation. Transforming a computation to make it strict involves simply 
adding data dependency edges as illustrated in Figures 5.1 and 5.2; we call this transformation, 
strictifying the computation. This transformation is always valid for depth-first computations. 
For arbitrary computations, however, there are examples for which strictifying adds data de- 
pendency edges that introduce cycles into the computation; for such computations, nonstrict 
spawns are required in any valid execution schedule. 

Consider an arbitrary depth-first multithreaded computation with work 7, computation 
depth 7, and activation depth A = $,. Strictifying this computation produces a new compu- 
tation with the same work and activation depth but with a possibly larger computation depth 
that we denote 7. Executing the strict computation on a P-processor computer with algo- 
rithm GDF generates an execution schedule VY with Sp(¥) < $,P and Tp(¥V) < T,/P+T®. 
Such a schedule achieves linear speedup provided the strict computation has sufficient aver- 
age available parallelism; that is, provided T,/T = Q(P). In general, any of the algorithms 
of Chapterss 5 and 6 achieve linear speedup (or linear expected speedup) provided the strict 
computation has average available parallelism sufficiently large relative to P (or Plg P). When 
T® is much larger than T,,, however, the strict computation may not have sufficient average 
available parallelism even though the original (nonstrict) computation does. The fact that a 


nonstrict computation may have far more parallelism than its strict counterpart is one of the 
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reasons for nonstrictness. Hence, we would like a technique by which a scheduler can exploit 
at least some of the parallelism offered by nonstrict spawns. 

The lower bound of Theorem 4, however, should temper our expectations. The computations 
demonstrated in the proof of Theorem 4 are all depth-first, but they use extreme amounts of 
nonstrictness in order to achieve parallelism. As the theorem shows, exploiting this nonstrict 
parallelism requires potentially unmanageable amounts of storage. Thus, we cannot hope for 
a technique that achieves parallel speedup from arbitrary uses of nonstrict spawns while still 
maintaining reasonable space bounds. 

In this chapter, we exhibit a technique that allows a scheduler to exploit some of the par- 
allelism available through nonstrict spawns. This technique allows the scheduler to perform 
some nonstrict spawns while still maintaining space bounds that are within a constant factor 
of the bounds it obtains for strict computations. Of course, this technique cannot guarantee 
any speedup from the nonstrict spawns, but it does guarantee execution time that is no greater 
than the execution time obtained by strictifying and executing the strict computation. 

It is important to realize that when space is bounded, the use of nonstrict spawns when 
executing a computation can actually result in an execution time that is longer than the exe- 
cution time that results from simply executing the strictified computation. Suppose we could 
execute the computation as if it were strictified, but at each step, if there is an idle processor 
and a thread that is stalled (due to the strictness condition) at a task that wants to spawn, we 
let the processor go ahead and execute that task thereby performing a nonstrict spawn. For 
example, in executing the computation of Figure 5.2(a), if at some time step t, execution of 
the parent thread is at the task wu that spawns the invocation (f' a 6) and execution of either 
the child thread evaluating expression A or the child thread evaluating expression B is not 
complete, then task w can only execute if a processor would otherwise go idle. Performing the 
spawn requires allocating an activation frame, and this is where the trouble lies as the following 
scenario illustrates: Suppose there is a single thread , computing a value A that is used by 
lots of other threads. At step t, one processor executes a task from , , and instead of idling, 


some of the other processors perform nonstrict spawns — invoking functions that have A as 
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an argument, for example. At the next step, the same thing happens, and this continues for 
awhile. Over time, memory gets filled up with the activation frames of these threads that were 
spawned nonstrictly. To avoid overflowing memory bounds, eventually these nonstrict spawns 
must cease. At this point, thread , is still computing A, and lots of other threads are stalled 
waiting for A. Now, if, wants to spawn a bunch of child threads to help it compute A, it 
cannot do so since memory is already full. In this case, the nonstrict spawns do not really add 
any useful parallelism since the spawned threads just stall waiting for A. Useful parallelism 
could have come from the evaluation of A, but with memory full, that parallelism cannot be 
exploited. Thus, performing nonstrict spawns may increase processor utilization for a brief spell 
but at the cost of forcing very low processor utilization for a potentially very long period of 
time — a period of time that could have had very high processor utilization had those nonstrict 
spawns not been performed. 

To keep the nonstrict spawns from hindering the progress of other parts of the computation, 
we classify each active thread as either strict or nonstrict and then ensure that the nonstrict 
threads do not fill up too much memory. When a thread is spawned nonstrictly, we say that 
the thread itself is nonstrict. A nonstrict thread remains nonstrict until those data dependen- 
cies that caused the spawn to be nonstrict in the first place get resolved. Once those data 
dependencies get resolved, the thread is strict. For example, in executing the computation 
in Figure 5.2(a), if the child thread that evaluates the invocation (Ff a b) is spawned non- 
strictly, then that thread remains nonstrict until both the thread evaluating expression A and 
the thread evaluating expression B terminate thereby resolving the associated data dependen- 
cies. A strictly spawned thread is considered strict and remains strict. Observe that from the 
time an active thread , becomes strict until the time , terminates, there is always at least one 
thread from the subtree rooted at , that is ready. This crucial property of strict threads in 
combination with an enforced bound on the space used by nonstrict threads forms the basis for 
a technique that we call a-sequestering. 

To ensure that the activation frames of nonstrict threads do not interfere with the progress 


of strict threads, the a-sequestering technique allocates separate space — the amount is deter- 
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mined by the value of a — for use by the nonstrict threads. By maintaining a separate region of 
memory for the activation frames of nonstrict threads, a-sequestering allows nonstrictness with- 
out adversely affecting the running time. We execute the computation as if it were strictified, 
but at each step, if there are idle processors and threads that are stalled (due to the strictness 
condition) at tasks that want to spawn, we allow processors to perform these nonstrict spawns 
so long as the activation frames of the resulting nonstrict threads do not overflow their region 
of memory. 

We illustrate the effectiveness of a-sequestering in conjunction with the global depth-first 
algorithm GDF. Suppose we allow nonstrict spawns only so long as no activation depth ever 
has more than aP active, nonstrict threads that span it. We are not specifying any specific 
way of prioritizing among nonstrict spawns — we are only saying that nonstrict spawns can 
only occur when processors would otherwise go idle, and they can only occur so long as no 
activation depth ever has more than aP active, nonstrict threads that span it. For this reason, 


we refer to this scheduling policy as the a-sequestered GDF method (rather than algorithm). 


Theorem 14 For any number P of processors and any depth-first multithreaded computation 
with work T,, strict computation depth T®), and activation depth A = $,, the a-sequestered 


oO 


GDF method computes a schedule XY such that Tp(¥) < 7T,/P+T® and Sp(X) < (1+ a0)S,P. 


Proof: The time bound follows from Theorem 1 since the schedule ¥ is greedy with respect 
to the strictified version of the computation. 

To prove the space bound, we show that no activation depth ever has more than (1 + a)P 
active threads that span it. Specifically, using the notation from the proof of Theorem 5, we 
show that for every activation depth d and every time step t, the bound s(t,d) < (1+ a)P 
holds. The space bound then follows from Equation (5.1). As before, we prove this bound by 
induction on the number of time steps, and again, the base case is obvious. 

Now, consider a time step t that begins with s(t,d) < (1+ a)P active threads spanning d. 
Further, let s’(t,d) denote the number of these threads that are strict. With s‘(t,d) active, 


strict threads spanning d, there must be at least s’(¢,d) ready threads at or deeper than d. We 
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consider two cases. In the first case, s‘(t,d) > P. In this case, there are at least P ready threads 
at or deeper than d, hence, no threads less deep than d execute at step t. Therefore, the number 
of active threads that span d cannot increase during step ¢t, so s(f+ 1,d) < s(t,d) < (14+ a)P. 
In the other case, s‘(t,d) < P, so as many as P — s‘(t,d) threads less deep than d may execute 
during step ¢. Consequently, the number of threads that span d may increase by as many as 


P — s'(t,d) but not more. Thus, 


s(t+1,d) < s(t,d)+(P—s'(t,d)) 


= P+(s(t,d)—s'(t,d)) 


lA 


PtaP 


since s(t,d) — s‘(t,d) is the number of active, nonstrict threads that span d, and this number, 
by force of the method, is no more than aP. In both cases, s(t + 1,d) < (1+ a)P, and the 


induction is complete. rT] 


Exactly as with GDF, we can use the a-sequestering technique with algorithm GDF’ to 
yield the a-sequestered GDF’ method. 


Theorem 15 For any number P of processors and any depth-first multithreaded computation 
with work T,, strict computation depth T®), and activation depth A = $,, the a-sequestered 


oO 


GDF’ method computes a schedule X such that Tp(¥) < T,/P+T® and Sp(¥) < (1+@)5,P. 


Proof: This proof follows the proof of Theorem 14, but we add the following assertion to the 
induction hypothesis: For any activation depth d and time step t, if t begins with s’(t, d) active, 
strict threads that span d, then t also begins with no more than max(P — s‘(t,d),0) processors 
having a thread with activation depth less than d. Proving that this additional assertion holds 


follows the proof of Theorem 6. rT] 


This a-sequestering technique can also be used with the local depth-first algorithm LDF. 


At each iteration, only those threads (strict or nonstrict) at or deeper than the cutoff depth 
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can execute, and each processor is allowed no more than r spawns (strict or nonstrict), where 
ris the synchronization parameter. Nonstrict spawns are allowed only when a processor would 
otherwise go idle and only so long as no activation depth ever has more than aP active, 
nonstrict threads that span it. This a-sequestered LDF method achieves execution time as 
stated in Theorem 11 but with T., replaced by T“); this result follows by making the obvious 


change in the proof of Lemma 9. The space bound is captured in the following theorem. 


Theorem 16 For any number P of processors and any depth-first multithreaded computation 
with activation depth A = S,, the a-sequestered LDF (r) method computes a schedule X such 
that Sp(V) < (Qr+a)S)P. 


Proof: We show that for any activation depth d and any iteration t, the bound s(t,d) < 
(2r+a)P holds. Again, we prove this bound by induction on the number of iterations, and the 
base case is obvious. 

Now, consider an iteration t that begins with s(t,d) < (2r+a)P active threads that span d. 
And as before, let s/(t,d) denote the number of active, strict threads that span d at the start 
of iteration t. Consider two cases. In the first case, s‘(t,d) > rP. In this case there are at least 
rP ready threads at or deeper than d and by pigeon-holing, some processor must have at least 
r of them. Therefore, the synchronization phase sets the cutoff depth D with D > d, hence, no 
thread less deep than d executes at iteration t. Consequently, s(¢+1,d) < s(t,d) < (Qr+a)P. 
In the other case, s‘(t,d) < rP. In this case, the number of active threads that span d may 
increase but not by more than rP since no processor may execute more than r spawns during 


an iteration. Then 


s({+1,d) < s(t,d)+rP 
= (s(t,d)—s'(t,d))+(s'(t,d)+rP) 
< (s(t, d) — s'(t,d))+2rP 
< aP+2rP 


since s(t,d) — s‘(t,d) is the number of active, nonstrict threads that span d, and this number, 
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by force of the method, is no more than aP. In both cases, s(t + 1,d) < (2r + a)P, and the 


induction is complete. rT] 


By adjusting the value of a, the a-sequestering technique provides some control over the 
space bounds and the allowable nonstrictness. With a = 0, the computation is forced to execute 
strictly. At the other extreme, with a = oo, the computation may execute with arbitrary 
amounts of nonstrictness (and achieve execution time within a factor of two of optimal by using 
a greedy schedule) but with a potentially huge demand on space. In order to maintain space 
bounds that are within a constant factor of those obtained with strict computations, the value a 
needs to be no more than a constant (for GDF or GDF’) or proportional to the synchronization 
parameter (for LDF). 

The a-sequestering technique does not specify how to schedule nonstrict spawns, it does 
not specify how to determine whether a particular spawn will be nonstrict, and it does not 
specify how to keep track of the space being used by nonstrict threads. All of these further 
specification are needed for a real algorithm or implementation. Furthermore, a-sequestering 
does not guarantee any speedup from the nonstrict parallelism. Nevertheless, with proper 
linguistic and runtime mechanisms, a-sequestering may prove feasible, and with new ways to 
prioritize the nonstrict spawns, a-sequestering may be able to exploit nonstrict parallelism 
with small values of a and provable speedup for specific uses of nonstrictness in depth-first 


computations. 
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Chapter 8 


Related work 


Storage management for multithreaded computations has been a concern for a number of years. 


In 1985, Halstead [12] described this problem. 


A classical difficulty for concurrent architectures occurs when there is too much 
parallelism in the program being executed. A program that unfolds into a very large 
number of parallel tasks may reach a deadlocked state where every task, to make 
progress, requires additional storage (e.g., to make yet more tasks), and no more 
storage is available. This can happen even though a sequential version of the same 
program requires very little storage. In effect, the sequential version executes the 
tasks one after another, allowing the same storage pool to be reused. By trying to 
execute all tasks at the same time, the parallel machine may run out of storage. 


Nevertheless, precious little prior work has addressed this problem. To date, most existing 
techniques for controlling storage requirements have consisted of heuristics to either bound 
storage use by explicitly controlling storage as a resource or reduce storage use by modifying 
the scheduler’s behavior. We are aware of no prior scheduling algorithms with proven time and 
space bounds. 

The storage management problem, as described by Halstead, can be quite pronounced under 
the execution of a fair scheduler. By executing threads in round-robin fashion, a fair scheduler 
gives each ready thread a fair portion of the execution time. A fair scheduler aggressively 
exposes parallelism, often resulting in excessive space requirements. Consider the multithreaded 
computation of Figure 8.1. Let N denote the number of leaf threads (this computation performs 
a divide-and-conquer algorithm on an input of size N’), and suppose each activation frame has 
unit size. This computation has work 7; = O(N) and activation depth A = O(lg V). Notice also 


that this computation is depth-first (and strict), and therefore it can be sequentially executed 
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Figure 8.1: A multithreaded computation to perform a divide-and-conquer algorithm. Each 
non-leaf thread spawns two children. Each child computes a value that it passes back to its 
parent. Once the parent gets a value back from each child, it computes a result value that it 
then passes up to its parent. 


using space S$, = A= O(lgN). A parallel execution with a fair scheduler, however, executes 
this computation in (nearly) breadth-first order; at some point in the execution, nearly every 
leaf thread is active, and therefore, the fair schedule V (with any number P > 2 of processors) 
uses space Sp(4’) = O(N) — an exponential blowup in storage requirements. 

In order to curb the excessive exposition of parallelism, and consequent excessive use of 
space, exhibited by fair scheduling, researchers from the dataflow community have developed 
heuristics to explicitly manage storage as a resource. The effectiveness of these heuristics is 
documented with encouraging empirical evidence but no provable time bounds. We consider 
two of these heuristic techniques: bounded loops and the course-grain throttle. 

Culler’s bounded loops technique [6, 7, 8] uses compile-time analysis to augment the program 
code with resource management code. For each loop of the program, the resource management 
code computes a value called the k-bound; a k-bounded loop can have at most fk iterations 
simultaneously active. The k-bound represents k tickets each of which buys the use of some 
storage. Once the loop has spawned & iterations, it must wait until one of those iterations 
completes and relinquishes its ticket; then the loop can use that ticket to spawn another it- 
eration. The compile-time analysis that generates the code that computes the k-bounds is 
based on heuristics developed from a systematic study of loops in scientific dataflow programs 
(programs employing only iteration and primitive recursion) [7]. These heuristics attempt to 


set the k-bounds so that the exposed parallelism is maximized under the constraint that space 
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usage stays within the machine’s capacity. 

Ruggiero’s course-grain throttle technique [30] makes storage allocation decisions based on 
overall machine activity at run-time. When a process (thread) wants to spawn a child, it 
must request an activation name from the resource management system. When the overall 
level of activity in the machine is high, the resource manager defers these requests, thereby 
suspending the requesting processes. When the activity level falls below a certain threshold, 
the resource manager begins granting deferred requests giving priority to the lowest, leftmost 
suspended processes in the process (activation) tree. Like the bounded loops technique, the 
goal of the coarse-grain throttle is to maximize the exposed parallelism under a fixed space 
usage constraint. 

In contrast with these heuristic techniques, we have chosen to develop an algorithmic foun- 
dation that manages storage by allowing programmers to leverage their knowledge of storage 
requirements for sequentially executed programs. The two techniques just described view stor- 
age as a resource that requires explicit management, and they actually modify execution be- 
havior based on these management policies. Such techniques, however, generally have not been 
needed for programs running on serial machines — when the machine runs out of memory, the 
program terminates. On most uniprocessor systems, the job of ensuring that the program does 
not use too much memory rests solely with the programmer, and such systems work because 
programmers understand the storage model and they understand the execution schedule that 
orders the invocations of the program’s procedures. On parallel systems, however, the storage 
model is somewhat more complex and predicting the execution order is somewhat more diffi- 
cult. Nevertheless, this increased complexity does not require encumbering parallel machines 
with responsibility for bounding storage requirements. Programmers should still be able to un- 
derstand the storage model, and by developing an algorithmic understanding of scheduling that 
relates parallel storage requirements to serial storage requirements, programmers should still 
be able to predict how much storage their programs will use when run on a parallel computer. 

Other researchers have also addressed the storage issue by attempting to relate parallel stor- 


age requirements to serial storage requirements. Halstead, in completing the quoted paragraph 
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above, made the following observation: 


Ideally, parallel tasks should be created until the processing power of the parallel 
machine is fully utilized (we may call this saturation) and then execution within 
each task should become sequential. [12] 


To emulate this ideal behavior, Halstead considered an unfair scheduling policy. When a 
processor executes a thread that spawns a child, the processor places the parent thread into a 
local LIFO pending queue and begins work on the child thread. If all the processors remain 
busy, the parent thread stays in the local pending queue until the child thread terminates. 
(This execution is exactly the type of depth-first sequential execution that is so familiar to 
programmers.) If, however, another processor goes idle in the meantime, then it may steal the 
pending parent thread. Thus, so long as all the processors remain busy, each processor operates 
depth-first out of its local queue and each local queue’s size is bounded by the maximum stack 
depth in a serial execution. On the strict computation of Figure 8.1, for example, this unfair 
scheduling policy computes a P-processor execution schedule V with Sp(4’) < $,P. When we 
consider more complex computations, even if we just consider strict computations, however, 
this unfair scheduling policy may exhibit greater than linear space expansion, and in general, 
predicting or bounding space usage is quite difficult. 

Characterizing the performance of Halstead’s unfair scheduling policy is even more difficult 
when we consider time bounds. Though this policy attempts to compute a greedy schedule 
by allowing idle processors to steal pending threads from other processors, success depends on 
the thread stealing algorithm. Other researchers [17, 23, 34] have considered variants of unfair 
scheduling, but none have fully developed or analyzed thread stealing algorithms. 

A multithreaded computation with no data dependency edges is equivalent to a backtrack 
search problem, and in this context, Zhang [36] actually did develop and analyze a thread 
stealing algorithm. Zhang showed that in a fully connected processor model with P processors, 
if idle processors choose other processors at random to steal work from, then a binary tree of 
size N and height h can be search in O(N/P +h) time with high probability. In the context 
of multithreaded computations with no data dependency edges, this bound translates into a 


schedule V that with high probability achieves Tp(V) = O(T,/P+T,.). Though Zhang did not 


15 


make the observation, his algorithm also demonstrates linear expansion of space: Sp(4’) < S,P. 
Other researchers [18, 29] have considered backtrack search on fixed-connection networks, but 
their algorithms explore the tree in breadth-first order and consequently demonstrate poor 


space performance. 
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Chapter 9 


Conclusions 


The results of this thesis just begin to develop our algorithmic understanding of nonstrictness 
in multithreaded computations. We have formalized a model of multithreaded computations 
and developed a working definition to characterize efficient execution schedules with respect 
to time and space usage. In general, it appears that arbitrary uses of nonstrictness can make 
efficient parallel execution difficult. In fact, we have demonstrated uses of nonstrictness that 
make efficient parallel execution provably impossible. This difficulty stands in sharp contrast to 
the situation with strict computations. For strict computations, we have shown the existence 
of efficient execution schedules for any number of processors, and further, we have exhibited 
(fairly) efficient online and distributed algorithms to compute such schedules. Between these 
extremes, we have a technique that allows the use of some nonstrictness in an otherwise strict 
computation without degrading the efficiency, but this technique does not guarantee any benefit 
from the nonstrictness. 

Even among the strict computations, some open problems still remain, most notably with 
respect to efficient and practical scheduling algorithms. For one thing, none of the algorithms 
presented in this thesis deal with the space used by persistent data structures. Also, the 
LDF algorithm of Chapter 6 does not take any advantage of locality. An algorithm that 
can keep groups of closely related threads in the same processor or that can exploit specific 
fixed-connection networks to keep related threads close to one another would alleviate some 
of the communication costs. The work on lazy task creation [23] and the work on dynamic 
tree embedding [3, 22] may provide some pointers in this direction. Of course, an algorithm 


that removes the lg P factor from the space bound of LDF would be a nice improvement. 
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Other algorithmic improvements to LDF might include: an algorithm that performs less thread 
migration, a technique to keep track of thread location when threads do migrate, a more 
asynchronous algorithm, and an incremental rebalancing technique to keep the individual queues 
bounded. Finally, it would be interesting to see if a deterministic distributed algorithm is 
possible. 

Turning back to nonstrict computations, we find a vast range of uncharted territory. Cur- 
rently, a-sequestering is the only technique we know of that allows nonstrictness in the execution 
of multithreaded computations while maintaining reasonable space and time bounds. This tech- 
nique may be practical if efficient support mechanisms can be developed. In this case, with 
simple algorithms for scheduling the nonstrict spawns, the a-sequestered methods described in 
Chapter 7 may perform well in practice using small values of a. 

We believe, however, that deriving any real benefit from either a-sequestering or any other 
technique for executing nonstrict computations depends on developing a fundamental under- 
standing of how nonstrictness can be used to realize increased parallelism. Computations that 
are inherently highly parallel can be packaged into programs in such a way that the parallelism 
can only be exploited through such extensive use of nonstrictness that efficient execution on 
a parallel computer is impossible. Therefore, we need to understand how to write programs 
in such a way that nonstrict parallelism can be exploited. Developing such an understanding 
might involve identifying useful patterns of usage for nonstrictness and developing algorithms 
to schedule computations that follow these patterns. Such advances would greatly increase the 
utility of nonstrictness and in general would expand the class of multithreaded computations 


for which efficient methods of execution are known. 
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